#pip install autogluonKaggle | Autogluon
자동 예측 프로그램인 Autogluon을 활용하여 titanic data를 적합해보자!
1. 라이브러리 imports
import pandas as pd
import numpy as np
## tabular(테이블) 형식의 데이터를 다루는 모듈을 다운로드한다.
from autogluon.tabular import TabularDataset, TabularPredictorC:\Users\hollyriver\anaconda3\envs\py\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
2. 분석
A. 데이터 입력
문제를 받아오는 과정으로 비유할 수 있다.
tr = TabularDataset('./data/train.csv') ## 학습할 데이터
tst = TabularDataset('./data/test.csv')
## tr = TabularDataset('/kaggle/input/titanic/train.csv') ## 학습할 데이터
## tst = TabularDataset('/kaggle/input/titanic/test.csv')
## tr = pd.read_csv('/kaggle/input/titanic/train.csv')
## tst### B. Predictor 생성
문제를 풀 학생을 생성하는 과정으로 비유할 수 있다.
predictr = TabularPredictor('Survived') ## target variable이 들어있는 데이터프레임, 변수 철자는 임의로 틀리게 설정No path specified. Models will be saved in: "AutogluonModels\ag-20231017_130536"
predictr는 뭔데?
type(predictr)autogluon.tabular.predictor.predictor.TabularPredictor
대충 autogluon에서의 class인듯.
C. 적합(fit)
학습 과정에 해당한다.
predictr.fit(tr) ## 학생(predictr)에게 문제(tr)를 주어 학습을 시킴(predictr.fit(tr))
##tr 그 자체로 학습할 수 있는 건 다 시킨다. sklearn의 모델과는 차이가 있음Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20231017_130536"
AutoGluon Version: 0.8.2
Python Version: 3.10.13
Operating System: Windows
Platform Machine: AMD64
Platform Version: 10.0.19045
Disk Space Avail: 57.71 GB / 255.01 GB (22.6%)
Train Data Rows: 891
Train Data Columns: 11
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 1930.98 MB
Train Data (Original) Memory Usage: 0.31 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['Name']
CountVectorizer fit with vocabulary size = 8
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
('object', []) : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
('object', ['text']) : 1 | ['Name']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 3 | ['Ticket', 'Cabin', 'Embarked']
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
('int', ['bool']) : 1 | ['Sex']
('int', ['text_ngram']) : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...]
0.3s = Fit runtime
11 features in original data used to generate 28 features in processed data.
Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.36s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
0.6536 = Validation score (accuracy)
1.83s = Training runtime
0.22s = Validation runtime
Fitting model: KNeighborsDist ...
0.6536 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMXT ...
0.8156 = Validation score (accuracy)
1.08s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBM ...
0.8212 = Validation score (accuracy)
0.43s = Training runtime
0.01s = Validation runtime
Fitting model: RandomForestGini ...
0.8156 = Validation score (accuracy)
0.64s = Training runtime
0.06s = Validation runtime
Fitting model: RandomForestEntr ...
0.8156 = Validation score (accuracy)
0.54s = Training runtime
0.06s = Validation runtime
Fitting model: CatBoost ...
0.8268 = Validation score (accuracy)
7.47s = Training runtime
0.01s = Validation runtime
Fitting model: ExtraTreesGini ...
0.8156 = Validation score (accuracy)
0.52s = Training runtime
0.06s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.8101 = Validation score (accuracy)
0.52s = Training runtime
0.06s = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 9: early stopping
0.8324 = Validation score (accuracy)
3.28s = Training runtime
0.03s = Validation runtime
Fitting model: XGBoost ...
0.8101 = Validation score (accuracy)
1.32s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetTorch ...
0.8212 = Validation score (accuracy)
7.74s = Training runtime
0.03s = Validation runtime
Fitting model: LightGBMLarge ...
0.8324 = Validation score (accuracy)
0.93s = Training runtime
0.01s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8324 = Validation score (accuracy)
0.67s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 28.23s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20231017_130536")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x20df4525d50>
학습 완료, 이에 따라 리더보드를 확인한다. (모의고사 채점)
predictr.leaderboard() model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBMLarge 0.832402 0.009000 0.928348 0.009000 0.928348 1 True 13
1 NeuralNetFastAI 0.832402 0.028001 3.279027 0.028001 3.279027 1 True 10
2 WeightedEnsemble_L2 0.832402 0.029001 3.949707 0.001000 0.670681 2 True 14
3 CatBoost 0.826816 0.006002 7.468165 0.006002 7.468165 1 True 7
4 LightGBM 0.821229 0.006004 0.432719 0.006004 0.432719 1 True 4
5 NeuralNetTorch 0.821229 0.031999 7.740229 0.031999 7.740229 1 True 12
6 LightGBMXT 0.815642 0.005002 1.084200 0.005002 1.084200 1 True 3
7 ExtraTreesGini 0.815642 0.061763 0.516426 0.061763 0.516426 1 True 8
8 RandomForestEntr 0.815642 0.064586 0.538568 0.064586 0.538568 1 True 6
9 RandomForestGini 0.815642 0.064718 0.637898 0.064718 0.637898 1 True 5
10 XGBoost 0.810056 0.013002 1.323482 0.013002 1.323482 1 True 11
11 ExtraTreesEntr 0.810056 0.062742 0.519159 0.062742 0.519159 1 True 9
12 KNeighborsDist 0.653631 0.005996 0.012003 0.005996 0.012003 1 True 2
13 KNeighborsUnif 0.653631 0.215770 1.826697 0.215770 1.826697 1 True 1
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LightGBMLarge | 0.832402 | 0.009000 | 0.928348 | 0.009000 | 0.928348 | 1 | True | 13 |
| 1 | NeuralNetFastAI | 0.832402 | 0.028001 | 3.279027 | 0.028001 | 3.279027 | 1 | True | 10 |
| 2 | WeightedEnsemble_L2 | 0.832402 | 0.029001 | 3.949707 | 0.001000 | 0.670681 | 2 | True | 14 |
| 3 | CatBoost | 0.826816 | 0.006002 | 7.468165 | 0.006002 | 7.468165 | 1 | True | 7 |
| 4 | LightGBM | 0.821229 | 0.006004 | 0.432719 | 0.006004 | 0.432719 | 1 | True | 4 |
| 5 | NeuralNetTorch | 0.821229 | 0.031999 | 7.740229 | 0.031999 | 7.740229 | 1 | True | 12 |
| 6 | LightGBMXT | 0.815642 | 0.005002 | 1.084200 | 0.005002 | 1.084200 | 1 | True | 3 |
| 7 | ExtraTreesGini | 0.815642 | 0.061763 | 0.516426 | 0.061763 | 0.516426 | 1 | True | 8 |
| 8 | RandomForestEntr | 0.815642 | 0.064586 | 0.538568 | 0.064586 | 0.538568 | 1 | True | 6 |
| 9 | RandomForestGini | 0.815642 | 0.064718 | 0.637898 | 0.064718 | 0.637898 | 1 | True | 5 |
| 10 | XGBoost | 0.810056 | 0.013002 | 1.323482 | 0.013002 | 1.323482 | 1 | True | 11 |
| 11 | ExtraTreesEntr | 0.810056 | 0.062742 | 0.519159 | 0.062742 | 0.519159 | 1 | True | 9 |
| 12 | KNeighborsDist | 0.653631 | 0.005996 | 0.012003 | 0.005996 | 0.012003 | 1 | True | 2 |
| 13 | KNeighborsUnif | 0.653631 | 0.215770 | 1.826697 | 0.215770 | 1.826697 | 1 | True | 1 |
score_val이 의미하는 것 * 실제로 predictr가 학습한 것은? > predictor와 train set이 있고, train set에 데이터가 1000개 있다고 하면 해당 데이터를 전부 가용하지 않는다. > * 800개를 사용한다고 하면 200개는 학습하지 않고 답을 맞춰 보는 식이다. > > * 200개는 왜 남겨두지? > > 문제에서 답을 찾는 규칙이 맞는지, 다른 데이터들에 대해서도 일반화시킬 수 있는 지 테스트 해보면 좋을 것 같다. 따라서 나머지 데이터셋에서 분석을 해본다. > > 실제 테스트에서 잘하기 위한 자체적 테스트셋에 해당, 200개의 나머지 테스트용 데이터셋을 validation set이라 일컫는다.
| train | val | |
|---|---|---|
| 학생1 | 95% | 72% |
| 학생2 | 80% | 80% |
| … | … | … |
train(연습문제)만 계속 푼 것 보다, val(모의고사)에서 가장 높은 점수를 받은 것이 유의미할 것.
그러니까 score_val는 모의고사 점수라고 보면 된다.
- 따라서 가장 높은 점수를 받은 WeightedEnsemble_L2모델을 사용해보자.[1]
[1] 처음 실습할 땐 분명 이게 제일 높았었는데…
### D. 예측(predict)
학습 이후에 문제를 푸는 과정으로 비유.
기존에 했던 분석들
- 무조건 남자는 죽고, 여자는 사는 형식 0.7x / 0.76555
- RandomForestClassifier를 사용한 형식 0.8x / 0.77511
- RandomForestClassifier에서 하이퍼파라미터를 조정한 형식 0.8x / 0.76555 (트레인 셋에서의 분석에서는 더 높았는데 실제 결과는 오히려 더 낮았다.)
4. WeightedEnsemble_L2모델 사용(알아서 사용하긴 함)
train set을 일단 풀어보자(predict)
type(tr) ## 처음 보는 것으로 저장되는데 데이터프레임에서 쓸 수 있는 모든 기능들을 다 사용할 수 있다.autogluon.core.dataset.TabularDataset
tr.head()| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
(tr.Survived == predictr.predict(tr)).mean()0.8810325476992144
정확도가 0.9349나 된다. 상당히 기대가 되는 부분
predictr.predict(tst)0 0
1 0
2 0
3 0
4 0
..
413 0
414 1
415 0
416 0
417 0
Name: Survived, Length: 418, dtype: int64
::: {#cell-27 .cell _kg_hide-input=‘false’ execution=‘{“iopub.status.busy”:“2023-09-14T13:21:20.809082Z”,“iopub.status.idle”:“2023-09-14T13:21:20.809518Z”,“shell.execute_reply”:“2023-09-14T13:21:20.809313Z”,“shell.execute_reply.started”:“2023-09-14T13:21:20.809294Z”}’}
tst.assign(Survived = predictr.predict(tst)).loc[:, ['PassengerId', 'Survived']]\
.to_csv('autogluon_submission.csv', index = False):::
제출 결과 정확도는 0.78947로 지금껏 가장 높은 수치가 나왔다.
3. 개선
결과를 좀 더 개선할 수 있지 않을까?
A. Fsize로 feature engeenering
1) 데이터
tr = TabularDataset('./data/train.csv') ## 학습할 데이터
tst = TabularDataset('./data/test.csv')Loaded data from: ./data/train.csv | Columns = 12 / 12 | Rows = 891 -> 891
Loaded data from: ./data/test.csv | Columns = 11 / 11 | Rows = 418 -> 418
-피쳐 엔지니어링
tr.assign(Fsize = tr.SibSp + tr.Parch)
tst.assign(Fsize = tst.SibSp + tst.Parch)
#tr.eval('Fsize = SibSp + Parch')
#tst.eval('Fsize = SibSp + Parch')
tr.head() ## 원본 데이터를 손상시키지 않음, Fsize 열이 추가되지 않은 것을 알 수 있음| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
2) Predictor 생성
predictr = TabularPredictor("Survived")No path specified. Models will be saved in: "AutogluonModels\ag-20231017_132447"
3) 적합(fit)
predictr.fit(tr.assign(Fsize = tr.SibSp + tr.Parch)) ## 새로운 데이터셋을 추가하여 학습Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20231017_132447"
AutoGluon Version: 0.8.2
Python Version: 3.10.13
Operating System: Windows
Platform Machine: AMD64
Platform Version: 10.0.19045
Disk Space Avail: 57.59 GB / 255.01 GB (22.6%)
Train Data Rows: 891
Train Data Columns: 12
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 1923.41 MB
Train Data (Original) Memory Usage: 0.32 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['Name']
CountVectorizer fit with vocabulary size = 8
Warning: Due to memory constraints, ngram feature count is being reduced. Allocate more memory to maximize model quality.
Reducing Vectorizer vocab size from 8 to 4 to avoid OOM error
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 5 | ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Fsize']
('object', []) : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
('object', ['text']) : 1 | ['Name']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 3 | ['Ticket', 'Cabin', 'Embarked']
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 5 | ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Fsize']
('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
('int', ['bool']) : 1 | ['Sex']
('int', ['text_ngram']) : 5 | ['__nlp__.miss', '__nlp__.mr', '__nlp__.mrs', '__nlp__.william', '__nlp__._total_']
0.3s = Fit runtime
12 features in original data used to generate 25 features in processed data.
Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.37s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
0.648 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
0.6425 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMXT ...
0.8268 = Validation score (accuracy)
0.43s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
0.8492 = Validation score (accuracy)
0.55s = Training runtime
0.01s = Validation runtime
Fitting model: RandomForestGini ...
0.7989 = Validation score (accuracy)
0.51s = Training runtime
0.06s = Validation runtime
Fitting model: RandomForestEntr ...
0.8156 = Validation score (accuracy)
0.5s = Training runtime
0.06s = Validation runtime
Fitting model: CatBoost ...
0.8268 = Validation score (accuracy)
6.8s = Training runtime
0.01s = Validation runtime
Fitting model: ExtraTreesGini ...
0.8045 = Validation score (accuracy)
0.45s = Training runtime
0.07s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.8045 = Validation score (accuracy)
0.44s = Training runtime
0.06s = Validation runtime
Fitting model: NeuralNetFastAI ...
0.8324 = Validation score (accuracy)
2.76s = Training runtime
0.02s = Validation runtime
Fitting model: XGBoost ...
0.8212 = Validation score (accuracy)
0.68s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetTorch ...
0.8324 = Validation score (accuracy)
9.58s = Training runtime
0.03s = Validation runtime
Fitting model: LightGBMLarge ...
0.838 = Validation score (accuracy)
0.83s = Training runtime
0.01s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8492 = Validation score (accuracy)
0.65s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 25.25s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20231017_132447")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x20d8e852770>
-리더보드 확인
predictr.leaderboard() model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBM 0.849162 0.010999 0.554687 0.010999 0.554687 1 True 4
1 WeightedEnsemble_L2 0.849162 0.012000 1.201853 0.001000 0.647166 2 True 14
2 LightGBMLarge 0.837989 0.005001 0.827996 0.005001 0.827996 1 True 13
3 NeuralNetFastAI 0.832402 0.024005 2.761039 0.024005 2.761039 1 True 10
4 NeuralNetTorch 0.832402 0.034000 9.577353 0.034000 9.577353 1 True 12
5 LightGBMXT 0.826816 0.004981 0.426716 0.004981 0.426716 1 True 3
6 CatBoost 0.826816 0.005996 6.798872 0.005996 6.798872 1 True 7
7 XGBoost 0.821229 0.008010 0.680577 0.008010 0.680577 1 True 11
8 RandomForestEntr 0.815642 0.063459 0.504724 0.063459 0.504724 1 True 6
9 ExtraTreesEntr 0.804469 0.062692 0.443597 0.062692 0.443597 1 True 9
10 ExtraTreesGini 0.804469 0.065061 0.448723 0.065061 0.448723 1 True 8
11 RandomForestGini 0.798883 0.064575 0.510397 0.064575 0.510397 1 True 5
12 KNeighborsUnif 0.648045 0.007998 0.008998 0.007998 0.008998 1 True 1
13 KNeighborsDist 0.642458 0.007999 0.011001 0.007999 0.011001 1 True 2
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LightGBM | 0.849162 | 0.010999 | 0.554687 | 0.010999 | 0.554687 | 1 | True | 4 |
| 1 | WeightedEnsemble_L2 | 0.849162 | 0.012000 | 1.201853 | 0.001000 | 0.647166 | 2 | True | 14 |
| 2 | LightGBMLarge | 0.837989 | 0.005001 | 0.827996 | 0.005001 | 0.827996 | 1 | True | 13 |
| 3 | NeuralNetFastAI | 0.832402 | 0.024005 | 2.761039 | 0.024005 | 2.761039 | 1 | True | 10 |
| 4 | NeuralNetTorch | 0.832402 | 0.034000 | 9.577353 | 0.034000 | 9.577353 | 1 | True | 12 |
| 5 | LightGBMXT | 0.826816 | 0.004981 | 0.426716 | 0.004981 | 0.426716 | 1 | True | 3 |
| 6 | CatBoost | 0.826816 | 0.005996 | 6.798872 | 0.005996 | 6.798872 | 1 | True | 7 |
| 7 | XGBoost | 0.821229 | 0.008010 | 0.680577 | 0.008010 | 0.680577 | 1 | True | 11 |
| 8 | RandomForestEntr | 0.815642 | 0.063459 | 0.504724 | 0.063459 | 0.504724 | 1 | True | 6 |
| 9 | ExtraTreesEntr | 0.804469 | 0.062692 | 0.443597 | 0.062692 | 0.443597 | 1 | True | 9 |
| 10 | ExtraTreesGini | 0.804469 | 0.065061 | 0.448723 | 0.065061 | 0.448723 | 1 | True | 8 |
| 11 | RandomForestGini | 0.798883 | 0.064575 | 0.510397 | 0.064575 | 0.510397 | 1 | True | 5 |
| 12 | KNeighborsUnif | 0.648045 | 0.007998 | 0.008998 | 0.007998 | 0.008998 | 1 | True | 1 |
| 13 | KNeighborsDist | 0.642458 | 0.007999 | 0.011001 | 0.007999 | 0.011001 | 1 | True | 2 |
4) 예측(predict)
(tr.Survived == predictr.predict(tr.assign(Fsize = tr.SibSp + tr.Parch))).mean()0.9696969696969697
tst.assign(Survived = predictr.predict(tst.assign(Fsize = tst.SibSp + tst.Parch))).loc[:,['PassengerId','Survived']]\
.to_csv("autogluon(Fsize)_submission.csv",index=False)제출 결과 : 점수가 오히려 더 낮아졌음
더 개선해보자
### B. Fsize + drop
1) data
-피처 엔지니어링 (데이터 불러오는 건 위에서 했으니 일단 생략
_tr = tr.assign(Fsize = lambda _df : _df.SibSp + _df.Parch).drop(['SibSp','Parch'],axis=1)
_tst = tst.assign(Fsize = tst.SibSp + tst.Parch).drop(['SibSp','Parch'],axis=1)
_tr.head()
## df.drop(columns = [])
## df.drop([], axis = 1) columns라고 지정해주지 않으면 디폴트로 행을 삭제하기 때문에| PassengerId | Survived | Pclass | Name | Sex | Age | Ticket | Fare | Cabin | Embarked | Fsize | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | A/5 21171 | 7.2500 | NaN | S | 1 |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38.0 | PC 17599 | 71.2833 | C85 | C | 1 |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 113803 | 53.1000 | C123 | S | 1 |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 373450 | 8.0500 | NaN | S | 0 |
2) Predictor 생성
predictr = TabularPredictor('Survived')No path specified. Models will be saved in: "AutogluonModels\ag-20231017_132627"
3) 적합(fit)
predictr.fit(_tr)Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20231017_132627"
AutoGluon Version: 0.8.2
Python Version: 3.10.13
Operating System: Windows
Platform Machine: AMD64
Platform Version: 10.0.19045
Disk Space Avail: 57.56 GB / 255.01 GB (22.6%)
Train Data Rows: 891
Train Data Columns: 10
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 1899.15 MB
Train Data (Original) Memory Usage: 0.31 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['Name']
CountVectorizer fit with vocabulary size = 8
Warning: Due to memory constraints, ngram feature count is being reduced. Allocate more memory to maximize model quality.
Reducing Vectorizer vocab size from 8 to 4 to avoid OOM error
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 3 | ['PassengerId', 'Pclass', 'Fsize']
('object', []) : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
('object', ['text']) : 1 | ['Name']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 3 | ['Ticket', 'Cabin', 'Embarked']
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 3 | ['PassengerId', 'Pclass', 'Fsize']
('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
('int', ['bool']) : 1 | ['Sex']
('int', ['text_ngram']) : 5 | ['__nlp__.miss', '__nlp__.mr', '__nlp__.mrs', '__nlp__.william', '__nlp__._total_']
0.3s = Fit runtime
10 features in original data used to generate 23 features in processed data.
Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.36s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 712, Val Rows: 179
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
0.6536 = Validation score (accuracy)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
0.648 = Validation score (accuracy)
0.02s = Training runtime
0.03s = Validation runtime
Fitting model: LightGBMXT ...
0.8212 = Validation score (accuracy)
0.47s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBM ...
0.838 = Validation score (accuracy)
0.64s = Training runtime
0.01s = Validation runtime
Fitting model: RandomForestGini ...
0.8045 = Validation score (accuracy)
0.54s = Training runtime
0.06s = Validation runtime
Fitting model: RandomForestEntr ...
0.8101 = Validation score (accuracy)
0.53s = Training runtime
0.06s = Validation runtime
Fitting model: CatBoost ...
0.8324 = Validation score (accuracy)
7.6s = Training runtime
0.01s = Validation runtime
Fitting model: ExtraTreesGini ...
0.7989 = Validation score (accuracy)
0.53s = Training runtime
0.06s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.8045 = Validation score (accuracy)
0.52s = Training runtime
0.06s = Validation runtime
Fitting model: NeuralNetFastAI ...
No improvement since epoch 9: early stopping
0.8268 = Validation score (accuracy)
1.95s = Training runtime
0.02s = Validation runtime
Fitting model: XGBoost ...
0.8268 = Validation score (accuracy)
0.45s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetTorch ...
0.8436 = Validation score (accuracy)
10.87s = Training runtime
0.03s = Validation runtime
Fitting model: LightGBMLarge ...
0.8324 = Validation score (accuracy)
0.82s = Training runtime
0.01s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8492 = Validation score (accuracy)
0.65s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 26.66s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20231017_132627")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x20d858cd060>
predictr.leaderboard() model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 WeightedEnsemble_L2 0.849162 0.043977 12.163954 0.000984 0.653803 2 True 14
1 NeuralNetTorch 0.843575 0.032010 10.867509 0.032010 10.867509 1 True 12
2 LightGBM 0.837989 0.010982 0.642641 0.010982 0.642641 1 True 4
3 LightGBMLarge 0.832402 0.006009 0.821788 0.006009 0.821788 1 True 13
4 CatBoost 0.832402 0.006051 7.597862 0.006051 7.597862 1 True 7
5 XGBoost 0.826816 0.013022 0.450137 0.013022 0.450137 1 True 11
6 NeuralNetFastAI 0.826816 0.017003 1.949074 0.017003 1.949074 1 True 10
7 LightGBMXT 0.821229 0.006997 0.471555 0.006997 0.471555 1 True 3
8 RandomForestEntr 0.810056 0.063482 0.526611 0.063482 0.526611 1 True 6
9 RandomForestGini 0.804469 0.061717 0.544051 0.061717 0.544051 1 True 5
10 ExtraTreesEntr 0.804469 0.064033 0.519959 0.064033 0.519959 1 True 9
11 ExtraTreesGini 0.798883 0.064803 0.533057 0.064803 0.533057 1 True 8
12 KNeighborsUnif 0.653631 0.006997 0.011003 0.006997 0.011003 1 True 1
13 KNeighborsDist 0.648045 0.031997 0.016009 0.031997 0.016009 1 True 2
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | WeightedEnsemble_L2 | 0.849162 | 0.043977 | 12.163954 | 0.000984 | 0.653803 | 2 | True | 14 |
| 1 | NeuralNetTorch | 0.843575 | 0.032010 | 10.867509 | 0.032010 | 10.867509 | 1 | True | 12 |
| 2 | LightGBM | 0.837989 | 0.010982 | 0.642641 | 0.010982 | 0.642641 | 1 | True | 4 |
| 3 | LightGBMLarge | 0.832402 | 0.006009 | 0.821788 | 0.006009 | 0.821788 | 1 | True | 13 |
| 4 | CatBoost | 0.832402 | 0.006051 | 7.597862 | 0.006051 | 7.597862 | 1 | True | 7 |
| 5 | XGBoost | 0.826816 | 0.013022 | 0.450137 | 0.013022 | 0.450137 | 1 | True | 11 |
| 6 | NeuralNetFastAI | 0.826816 | 0.017003 | 1.949074 | 0.017003 | 1.949074 | 1 | True | 10 |
| 7 | LightGBMXT | 0.821229 | 0.006997 | 0.471555 | 0.006997 | 0.471555 | 1 | True | 3 |
| 8 | RandomForestEntr | 0.810056 | 0.063482 | 0.526611 | 0.063482 | 0.526611 | 1 | True | 6 |
| 9 | RandomForestGini | 0.804469 | 0.061717 | 0.544051 | 0.061717 | 0.544051 | 1 | True | 5 |
| 10 | ExtraTreesEntr | 0.804469 | 0.064033 | 0.519959 | 0.064033 | 0.519959 | 1 | True | 9 |
| 11 | ExtraTreesGini | 0.798883 | 0.064803 | 0.533057 | 0.064803 | 0.533057 | 1 | True | 8 |
| 12 | KNeighborsUnif | 0.653631 | 0.006997 | 0.011003 | 0.006997 | 0.011003 | 1 | True | 1 |
| 13 | KNeighborsDist | 0.648045 | 0.031997 | 0.016009 | 0.031997 | 0.016009 | 1 | True | 2 |
4) 예측(predict)
(_tr.Survived == predictr.predict(_tr)).mean()0.9472502805836139
predictr.predict(_tr)0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: Survived, Length: 891, dtype: int64
_tst.assign(Survived = predictr.predict(_tst)).loc[:, ['PassengerId', 'Survived']]\
.to_csv('autogluon(Fsize,Drop)_submission.csv', index = False)지금껏 가장 높은 결과가 나왔다!
- 다중 공선성 문제를 개선한 결과라고 볼 수 있지… 음음.
아니, 모자라. 더 개선해!!!
### C. best_quality
1) data
tr = TabularDataset("./data/train.csv")
tst = TabularDataset("./data/test.csv")Loaded data from: ./data/train.csv | Columns = 12 / 12 | Rows = 891 -> 891
Loaded data from: ./data/test.csv | Columns = 11 / 11 | Rows = 418 -> 418
2) predictor 생성
predictr = TabularPredictor("Survived")No path specified. Models will be saved in: "AutogluonModels\ag-20231017_132948"
3) 적합(fit)
어떤 자원이 들어가든, 전부 지원해줄 테니 가장 좋은 퀄리티로 산출해!!
predictr.fit(tr, presets = 'best_quality') Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels\ag-20231017_132948"
AutoGluon Version: 0.8.2
Python Version: 3.10.13
Operating System: Windows
Platform Machine: AMD64
Platform Version: 10.0.19045
Disk Space Avail: 57.53 GB / 255.01 GB (22.6%)
Train Data Rows: 891
Train Data Columns: 11
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 1996.57 MB
Train Data (Original) Memory Usage: 0.31 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Fitting TextSpecialFeatureGenerator...
Fitting BinnedFeatureGenerator...
Fitting DropDuplicatesFeatureGenerator...
Fitting TextNgramFeatureGenerator...
Fitting CountVectorizer for text features: ['Name']
CountVectorizer fit with vocabulary size = 8
Warning: Due to memory constraints, ngram feature count is being reduced. Allocate more memory to maximize model quality.
Reducing Vectorizer vocab size from 8 to 4 to avoid OOM error
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
('object', []) : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
('object', ['text']) : 1 | ['Name']
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 3 | ['Ticket', 'Cabin', 'Embarked']
('float', []) : 2 | ['Age', 'Fare']
('int', []) : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
('int', ['bool']) : 1 | ['Sex']
('int', ['text_ngram']) : 5 | ['__nlp__.miss', '__nlp__.mr', '__nlp__.mrs', '__nlp__.william', '__nlp__._total_']
0.4s = Fit runtime
11 features in original data used to generate 24 features in processed data.
Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.39s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ...
0.6296 = Validation score (accuracy)
0.01s = Training runtime
0.02s = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ...
0.6352 = Validation score (accuracy)
0.01s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ...
Will use sequential fold fitting strategy because import of ray failed. Reason: ray is required to train folds in parallel for TabularPredictor or HPO for MultiModalPredictor. A quick tip is to install via `pip install ray==2.6.3`
Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
0.835 = Validation score (accuracy)
3.82s = Training runtime
0.05s = Validation runtime
Fitting model: LightGBM_BAG_L1 ...
Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
0.8373 = Validation score (accuracy)
5.36s = Training runtime
0.06s = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ...
0.8339 = Validation score (accuracy)
0.55s = Training runtime
0.1s = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ...
0.8305 = Validation score (accuracy)
0.54s = Training runtime
0.1s = Validation runtime
Fitting model: CatBoost_BAG_L1 ...
Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
0.8552 = Validation score (accuracy)
72.17s = Training runtime
0.04s = Validation runtime
Fitting model: ExtraTreesGini_BAG_L1 ...
0.8238 = Validation score (accuracy)
0.51s = Training runtime
0.11s = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L1 ...
0.8316 = Validation score (accuracy)
0.49s = Training runtime
0.1s = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ...
Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
No improvement since epoch 7: early stopping
No improvement since epoch 6: early stopping
No improvement since epoch 7: early stopping
0.853 = Validation score (accuracy)
20.42s = Training runtime
0.13s = Validation runtime
Fitting model: XGBoost_BAG_L1 ...
Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
0.8373 = Validation score (accuracy)
3.6s = Training runtime
0.06s = Validation runtime
Fitting model: NeuralNetTorch_BAG_L1 ...
Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
0.8462 = Validation score (accuracy)
68.5s = Training runtime
0.19s = Validation runtime
Fitting model: LightGBMLarge_BAG_L1 ...
Fitting 8 child models (S1F1 - S1F8) | Fitting with SequentialLocalFoldFittingStrategy
0.8429 = Validation score (accuracy)
8.68s = Training runtime
0.06s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.8552 = Validation score (accuracy)
0.84s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 188.35s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels\ag-20231017_132948")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x20d90435000>
대신 시간이 상당히 오래 걸린다…
- 리더보드 확인
predictr.leaderboard() model score_val pred_time_val fit_time pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 CatBoost_BAG_L1 0.855219 0.036927 72.167391 0.036927 72.167391 1 True 7
1 WeightedEnsemble_L2 0.855219 0.038929 73.009209 0.002002 0.841818 2 True 14
2 NeuralNetFastAI_BAG_L1 0.852974 0.130997 20.415231 0.130997 20.415231 1 True 10
3 NeuralNetTorch_BAG_L1 0.846240 0.194014 68.497755 0.194014 68.497755 1 True 12
4 LightGBMLarge_BAG_L1 0.842873 0.056998 8.680638 0.056998 8.680638 1 True 13
5 XGBoost_BAG_L1 0.837262 0.055978 3.598592 0.055978 3.598592 1 True 11
6 LightGBM_BAG_L1 0.837262 0.061885 5.357185 0.061885 5.357185 1 True 4
7 LightGBMXT_BAG_L1 0.835017 0.049997 3.816595 0.049997 3.816595 1 True 3
8 RandomForestGini_BAG_L1 0.833895 0.096996 0.553528 0.096996 0.553528 1 True 5
9 ExtraTreesEntr_BAG_L1 0.831650 0.095051 0.494969 0.095051 0.494969 1 True 9
10 RandomForestEntr_BAG_L1 0.830527 0.101071 0.535026 0.101071 0.535026 1 True 6
11 ExtraTreesGini_BAG_L1 0.823793 0.111044 0.513969 0.111044 0.513969 1 True 8
12 KNeighborsDist_BAG_L1 0.635241 0.004996 0.006006 0.004996 0.006006 1 True 2
13 KNeighborsUnif_BAG_L1 0.629630 0.015998 0.005992 0.015998 0.005992 1 True 1
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | CatBoost_BAG_L1 | 0.855219 | 0.036927 | 72.167391 | 0.036927 | 72.167391 | 1 | True | 7 |
| 1 | WeightedEnsemble_L2 | 0.855219 | 0.038929 | 73.009209 | 0.002002 | 0.841818 | 2 | True | 14 |
| 2 | NeuralNetFastAI_BAG_L1 | 0.852974 | 0.130997 | 20.415231 | 0.130997 | 20.415231 | 1 | True | 10 |
| 3 | NeuralNetTorch_BAG_L1 | 0.846240 | 0.194014 | 68.497755 | 0.194014 | 68.497755 | 1 | True | 12 |
| 4 | LightGBMLarge_BAG_L1 | 0.842873 | 0.056998 | 8.680638 | 0.056998 | 8.680638 | 1 | True | 13 |
| 5 | XGBoost_BAG_L1 | 0.837262 | 0.055978 | 3.598592 | 0.055978 | 3.598592 | 1 | True | 11 |
| 6 | LightGBM_BAG_L1 | 0.837262 | 0.061885 | 5.357185 | 0.061885 | 5.357185 | 1 | True | 4 |
| 7 | LightGBMXT_BAG_L1 | 0.835017 | 0.049997 | 3.816595 | 0.049997 | 3.816595 | 1 | True | 3 |
| 8 | RandomForestGini_BAG_L1 | 0.833895 | 0.096996 | 0.553528 | 0.096996 | 0.553528 | 1 | True | 5 |
| 9 | ExtraTreesEntr_BAG_L1 | 0.831650 | 0.095051 | 0.494969 | 0.095051 | 0.494969 | 1 | True | 9 |
| 10 | RandomForestEntr_BAG_L1 | 0.830527 | 0.101071 | 0.535026 | 0.101071 | 0.535026 | 1 | True | 6 |
| 11 | ExtraTreesGini_BAG_L1 | 0.823793 | 0.111044 | 0.513969 | 0.111044 | 0.513969 | 1 | True | 8 |
| 12 | KNeighborsDist_BAG_L1 | 0.635241 | 0.004996 | 0.006006 | 0.004996 | 0.006006 | 1 | True | 2 |
| 13 | KNeighborsUnif_BAG_L1 | 0.629630 | 0.015998 | 0.005992 | 0.015998 | 0.005992 | 1 | True | 1 |
4) 예측(predict)
(tr.Survived == predictr.predict(tr)).mean()0.9158249158249159
tst[['PassengerId']].assign(Survived = predictr.predict(tst))\
.to_csv("autogluon(best_quality)_submission.csv",index=False)하지만 결과는 확실하다. 무려 0.813…